?DistributionsBasic Statistics in R
RAdelaide 2025
Statistics in R
Introduction
Rhas it’s origins as a statistical analysis language (i.e.S)- Purpose of this session is NOT to teach statistical theory
- I am a bioinformatician NOT a statistician
- Perform simple analyses in R
- Up to you to know what you’re doing
- Or talk to your usual statisticians & collaborators
Distributions
Rcomes with nearly every distribution- Standard syntax for accessing each
Distributions
| Distribution | Density | Area Under Curve | Quantile | Random |
|---|---|---|---|---|
| Normal | dnorm() |
pnorm() |
qnorm() |
rnorm() |
| T | dt() |
pt() |
qt() |
rt() |
| Uniform | dunif() |
punif() |
qunif() |
runif() |
| Exponential | dexp() |
pexp() |
qexp() |
rexp() |
| \(\chi^2\) | dchisq() |
pchisq() |
qchisq() |
rchisq() |
| Binomial | dbinom() |
pbinom() |
qbinom() |
rbinom() |
| Poisson | dpois() |
ppois() |
qpois() |
rpois() |
Distributions
- Also Beta, \(\Gamma\), Log-Normal, F, Geometric, Cauchy, Hypergeometric etc…
Distributions
## dnorm gives the classic bell-curve
tibble(
x = seq(-4, 4, length.out = 1e3)
) %>%
ggplot(aes(x, y = dnorm(x))) +
geom_line(colour = "red")## pnorm gives the area under the
## bell-curve (which sums to 1)
tibble(
x = seq(-4, 4, length.out = 1e3)
) %>%
ggplot(aes(x, y = pnorm(x))) +
geom_line()The T Distribution
- A T distribution looks very much like a Standard normal N(0, 1) but has heavier tails
- This allows for greater uncertainty in the tails
Tests For Continuous Data
Data For This Session
We’ll use the pigs dataset from earlier
library(tidyverse)
library(scales)
library(car)
library(palmerpenguins)
theme_set(
theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)
pigs <- file.path("data", "pigs.csv") %>%
read_csv %>%
mutate(
dose = fct(dose, levels = c("Low", "Med", "High")),
supp = fct(supp)
)Data For This Session
pigs %>%
ggplot(
aes(x = dose, y = len, fill = supp)
) +
geom_boxplot()Pop Quiz
- A p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
- In plain English, assuming there’s nothing interesting to see here, how likely are we to observe our result, or one even more extreme
- A p-value of 0.05 \(\implies\) about 1 in 20 times we’ll see something like this in a random sample
t-tests
- Assumes normally distributed data
- \(t\)-tests always test \(H_0\) Vs \(H_A\)
- For data with exactly two groups
. . .
- The simplest test is on a simple vector
- Not particularly meaningful for our data
?t.test
t.test(pigs$len)The true mean of the underlying distribution from which the vector is sampled, is zero: i.e. \(\mu = 0\)
t-tests
When comparing the means of two vectors
\[ H_0: \mu_{1} = \mu_{2} \\ H_A: \mu_{1} \neq \mu_{2} \]
We could use two vectors (i.e. x & y)
vc <- dplyr::filter(pigs, supp == "VC")$len
oj <- dplyr::filter(pigs, supp == "OJ")$len
t.test(x = vc, y = oj). . .
No
t-tests
- An alternative is the
Rformula method:len~supp- Length is a response variable
- Supplement is the predictor
- Can only use one predictor for a T-test
- Otherwise it’s linear regression
t.test(len~supp, data = pigs)Did this give the same results?
t-tests
- Do we think the variance is equal between the two groups?
pigs |> summarise(sd = sd(len), .by = supp)# A tibble: 2 × 2
supp sd
<fct> <dbl>
1 VC 8.27
2 OJ 6.61
. . .
- We can use Levene’s Test to formalise this
- From the package
car - Bartlett’s test is very similar (
bartlett.test())
- From the package
leveneTest(len~supp, data = pigs)t-tests
- Now we can assume equal variances
- By default, variances are assumed to be unequal
t.test(len~supp, data = pigs, var.equal = TRUE). . .
- If relevant, the confidence interval can also be adjusted
Wilcoxon Tests
- We assumed the above dataset was normally distributed:
What if it’s not?
. . .
- Non-parametric equivalent is the Wilcoxon Rank-Sum Test (aka Mann-Whitney)
. . .
- This assigns ranks to each value based on their value
- The test is then performed on ranks NOT the values
- Tied values can be problematic
- Test that the centre of each underlying distribution is the same
wilcox.test(len~supp, data = pigs)A Brief Comment
- Both of these are suitable for comparing two groups
- T-tests assume Normally Distributed Data underlies the random sample
- Are robust to some deviation from normality
- Data can sometimes be transformed (e.g.
sqrt(),log()etc)
. . .
- The Wilcoxon Rank Sum Test assumes nothing about the underlying distribution
- Much less powerful will small sample sizes
- Highly comparable at n \(\geq\) 30
- The package
coinimplements a range on non-parametric tests
Tests For Categorical Data
\(\chi^2\) Test
- Here we need counts and categories
- Commonly used in Observed Vs Expected
\[ H_0: \text{No association between groups and outcome}\\ H_A: \text{Association between groups and outcome} \]
When expected cell values are > 5 (Cochran 1954)
\(\chi^2\) Test
pass <- matrix(
c(25, 8, 6, 15), nrow = 2,
dimnames = list(
c("Attended", "Skipped"),
c("Pass", "Fail"))
)
pass Pass Fail
Attended 25 6
Skipped 8 15
. . .
pass_chisq <- chisq.test(pass)
pass_chisq
Pearson's Chi-squared test with Yates' continuity correction
data: pass
X-squared = 9.8359, df = 1, p-value = 0.001711
Fisher’s Exact Test
- \(\chi^2\) tests became popular in the days of the printed tables
- We now have computers
- Fisher’s Exact Test is preferable in the cases of low cell counts
- (Or any other time you feel like it…)
- Same \(H_0\) as the \(\chi^2\) test
- Uses the hypergeometric distribution
fisher.test(pass)Summary of Tests
t.test(),wilcox.test()chisq.test(),fisher.test()
. . .
shapiro.test(),bartlett.test()car::leveneTest()- Tests for normality or homogeneity of variance
. . .
binomial.test(),poisson.test()kruskal.test(),ks.test()
htest Objects
- All produce objects of class
htest - Is really a
list- Use
names()to see what other values are returned
- Use
names(pass_chisq). . .
- Will vary slightly between tests
- Can usually extract p-values using
test$p.value
pass_chisq$p.valuehtest Objects
## Have a look at the list elements produced by fisher.test
fisher.test(pass) |> names()[1] "p.value" "conf.int" "estimate" "null.value" "alternative"
[6] "method" "data.name"
. . .
## Are these similar to those produced by t.test?
t.test(len~supp, data = pigs) |> names() [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "stderr" "alternative" "method" "data.name"
. . .
- There is a function
print.htest()which organises the printout for us
- We’ll come back to methods later, but this is a common way for output to be produced
- The will be a print function for objects of each class, i.e.
print.class_of_object
Linear Regression
Linear Regression
We are trying to estimate a line with slope & intercept
\[ y = ax + b \]
. . .
Or
\[ y = \beta_0 + \beta_1 x \]
- \(y\) is the response variable
- \(x\) is the predictor variable
- Makes the most intuitive sense when both \(x\) & \(y\) are continuous
Linear Regression
Linear Regression always uses the R formula syntax
y ~ x:ydepends onx- We use the function
lm() - Once we have our model, we can predict \(y\) based on \(x\) values
. . .
- We’ll use the penguins dataset for some exploration
Linear Regression
## Does bill length depend on body mass?
## Plot the predictor (body mass) on `x`
## The response (bill_length) goes on `y`
penguins |>
ggplot(
aes(body_mass_g, bill_length_mm)
) +
geom_point() +
geom_smooth(method = "lm")- There looks like a fairly clear linear relationship
Linear Regression
- Bill length is the response variable (\(y\))
- Body Mass is the predictor variable (\(x\))
## Fit a linear regression model and save the output as an R object
bill_length_lm <- lm(bill_length_mm ~ body_mass_g, data = penguins)
bill_length_lm
Call:
lm(formula = bill_length_mm ~ body_mass_g, data = penguins)
Coefficients:
(Intercept) body_mass_g
26.898872 0.004051
Linear Regression
- The basic output is only minimally informative
- Often pass to the function
summary()to present and interpret
## Use `summary` to view and interpret the fitted model
summary(bill_length_lm)
Call:
lm(formula = bill_length_mm ~ body_mass_g, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-10.1251 -3.0434 -0.8089 2.0711 16.1109
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.690e+01 1.269e+00 21.19 <2e-16 ***
body_mass_g 4.051e-03 2.967e-04 13.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.394 on 340 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.3542, Adjusted R-squared: 0.3523
F-statistic: 186.4 on 1 and 340 DF, p-value: < 2.2e-16
Linear Regression
- The code used is returned as
Call: - A brief summary of the residuals are returned
- The fitted values are returned in the
Coefficientselement- We have the estimate of the intercept & slope
- The standard error of the estimates
- A t-test testing \(H_0: \beta_i = 0\)
- The p-value for each t-test
- Additional model summary information
- \(R^2\) is the proportion of variance explained by the model
- Adjusted R-squared is the R-squared adjusted for the number of predictors
- Mostly relevant when comparing models where predictors vary
Coefficients
This is what the bill length would be if a penguin weighed exactly 0
- We’d probably expect this to be zero but they almost never are
- The Intercept is almost always significant
- What does this really tell us about the relationship between bill length and weight?
- Generally focussed on the relationship within the range of observed predictors
- No guaranteed linear relationship outside of this range
Coefficients
body_mass_g term?
This is how much we would expect the bill length to change for every one unit increase in the predictor
i.e. for every 1\(g\) increase in weight, the bill length would be expected to increase by about 0.0041mm
- The \(t\)-test here is highly relevant
- \(H_0\colon \beta_1 = 0~\) Vs \(~H_A\colon \beta_1 \neq 0\)
- Reject \(H_0 \implies\) there is an association between predictor & response
Linear Regression
- Points never lie exactly on the regression line \(\implies\) Why?
. . .
- We’re actually fitting the model
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
- \(\beta_0 + \beta_1 x_i\) is the exact line co-ordinate (Intercept + slope*predictor)
- \(\epsilon_i\) is the the vertical difference between the observed value and the fitted value
- Known as a residual
- Defined as \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)
- Residual means the bit left over when we subtract the predicted value from the observed
Linear Regression
Linear Regression formally has 4 key assumptions
- Linear relationship between predictor and response
- Mean of residuals is zero across the entire range
- Constant variance across the range of data (homoscedasticity)
- Residuals are normally distributed
- Independence of errors
- Three of these are represented in the definition \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)
Linear Regression
- To check our fitted model, we should check the residuals to ensure \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)
## Check the residuals. It will ask you the following
## Hit <Return> to see next plot:
## So follow the instructions and check each plot before proceeding
## There are 4 plots by default
plot(bill_length_lm)Linear Regression
- Check the zero mean of \(\mathcal{N}(0, \sigma)\)
- Is this assumption satisfied across the range of the data?
Linear Regression
- Check the normality of \(\mathcal{N}(0, \sigma)\)
- The dashed line is the expected line from a normal distribution
- Is this assumption satisfied across the range of the data?
Linear Regression
- Check the constant variance of \(\mathcal{N}(0, \sigma)\)
- Is this assumption satisfied across the range of the data?
Linear Regression
- Checks if any points are exerting excessive ‘leverage’ on the model
- Beyond scope of today
Regression Diagnostic Plots
- All of these figures used base plotting functions
- Stepping through can be frustrating
- Especially when running an automated script
. . .
- Can plot all 4 at once using a simple trick
## Define a 2x2 grid, plotting figures in a row-wise manner
par(mfrow = c(2, 2))
## Plot the four regression diagnostic plots
plot(bill_length_lm)
## Reset the grid layout to be the default 1x1
par(mfrow = c(1, 1,))par()is a function to set graphical parametersmfrowmeans ‘Multi-Figures By Row’- c(2, 2,) means plot in a 2x2 grid
- I rarely show these so no need for ggplot versions
Objects Of Class lm
- The linear model we fitted produced an object of class
lm
class(bill_length_lm). . .
- This is also a list
## Inspect the actual object
glimpse(bill_length_lm)Objects Of Class lm
- We can use the list structure to inspect the residuals manually
## Plot a histogram of the residuals
hist(bill_length_lm$residuals)Objects Of Class lm
- We could even use the Shapiro-Wilk test for normality
shapiro.test(bill_length_lm$residuals)
Shapiro-Wilk normality test
data: bill_length_lm$residuals
W = 0.95439, p-value = 8.217e-09
. . .
- How can we interpret all of this?
- Maybe there’s a better model
Adding Terms
## Does bill length depend on body mass?
## Plot the predictor (body mass) on `x`
## The response (bill_length) goes on `y`
## Does each species need it's own model
penguins |>
ggplot(
aes(
body_mass_g, bill_length_mm,
colour = species
)
) +
geom_point() +
geom_smooth(method = "lm")- Do we think these slopes are the same?
- Do we think the intercepts are the same?
Adding Terms
## Include the species in the model
## NB: This will fit a separate intercept for each species
bill_length_sp_lm <- lm(bill_length_mm ~ species + body_mass_g, data = penguins)
summary(bill_length_sp_lm)
Call:
lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-6.8129 -1.6718 0.1336 1.4720 9.2902
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.492e+01 1.063e+00 23.443 < 2e-16 ***
speciesChinstrap 9.921e+00 3.511e-01 28.258 < 2e-16 ***
speciesGentoo 3.558e+00 4.858e-01 7.324 1.78e-12 ***
body_mass_g 3.749e-03 2.823e-04 13.276 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.403 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.808, Adjusted R-squared: 0.8063
F-statistic: 474 on 3 and 338 DF, p-value: < 2.2e-16
- There is already a large increase in R-squared so this model may fit better
Adding Terms
## Check the residuals after
## including a separate intercept
## for species
par(mfrow = c(2, 2))
plot(bill_length_sp_lm)
par(mfrow = c(1, 1))Model Diagnostics
## Check the histogram of residuals
hist(bill_length_sp_lm$residuals, breaks = 20)## Perform the Shapiro-Wilk test for Normality
shapiro.test(bill_length_sp_lm$residuals)
Shapiro-Wilk normality test
data: bill_length_sp_lm$residuals
W = 0.99317, p-value = 0.123
Interpreting the Coefficients
- Now we’re happier with the model \(\rightarrow\) what do the coefficients mean?
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap 9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo 3.557977539 0.4857896978 7.324111 1.776921e-12
body_mass_g 0.003748497 0.0002823439 13.276352 1.158990e-32
. . .
- By default, the primary Intercept is the first factor level in
species
levels(penguins$species)[1] "Adelie" "Chinstrap" "Gentoo"
Interpreting the Coefficients
- Now we’re happier with the model \(\rightarrow\) what do the coefficients mean?
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap 9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo 3.557977539 0.4857896978 7.324111 1.776921e-12
body_mass_g 0.003748497 0.0002823439 13.276352 1.158990e-32
- The baseline intercept is for Adelie penguins
- Additional intercept terms are the differences between the baseline and each species
- Does this check-out in our initial plot of the data?
- They all appear significant when checking each \(H_0\)
Adding Terms
- Do we think each species may have a different relationship between mass and bill length?
- Do we need to fit a separate slope for each species?
- This is done in
Rusing an “interaction term” separating terms by:species:body_mass_g
bill_length_int_lm <- lm(
bill_length_mm ~ species + body_mass_g + species:body_mass_g,
data = penguins
)
summary(bill_length_int_lm)- Now what do we think?
Interpreting the Coefficients
- The baseline terms for the Intercept and
body_mass_gare now both for Adelie - Differences in slope for each species are provided as
speciesChinstrap:body_mass_gandspeciesGentoo:body_mass_g
. . .
- The regression line for Adelie is y = 26.99 + 0.0032*
body_mass_g
. . .
- For Chinstrap: y = (26.99+5.18) + (0.0032+0.0013)*body_mass_g`
Interpreting the Coefficients
- The model matrix \(X\) is a key part of understanding statistics
- The model coefficients (\(\hat{\beta}\)) are actually estimated by performing linear algebra on this
- The least squares solution is \(\hat{\beta} = (X^TX)^{-1}X^Ty\)
. . .
y <- penguins$bill_length_mm[!is.na(penguins$body_mass_g)]
X <- model.matrix(~ species + body_mass_g + species:body_mass_g, data = penguins)Interpreting the Coefficients
head(X) (Intercept) speciesChinstrap speciesGentoo body_mass_g
1 1 0 0 3750
2 1 0 0 3800
3 1 0 0 3250
5 1 0 0 3450
6 1 0 0 3650
7 1 0 0 3625
speciesChinstrap:body_mass_g speciesGentoo:body_mass_g
1 0 0
2 0 0
3 0 0
5 0 0
6 0 0
7 0 0
Interpreting the Coefficients
## No need to save this, I'm just demonstrating a point
## The inverse of a matrix is found using `solve()`
## Matrix multiplication is performed using %*%
## The transpose of a matrix is found using `t()`
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
cbind(beta_hat, coef(bill_length_int_lm)) [,1] [,2]
(Intercept) 26.9941391366 26.9941391367
speciesChinstrap 5.1800537287 5.1800537287
speciesGentoo -0.2545906615 -0.2545906615
body_mass_g 0.0031878758 0.0031878758
speciesChinstrap:body_mass_g 0.0012748183 0.0012748183
speciesGentoo:body_mass_g 0.0009029956 0.0009029956
Interpreting the Coefficients
- An excellent primer on model matrices is here1
- Conventional statistics always sets a baseline level for the intercept
- Performs a \(t\)-test automatically for differences
- Really quite convenient if less obvious
- We could remove the common intercept using
~ 0 + species + body_mass_g.- Would return a separate intercept by removing the common (baseline) level
- No real advantage except non-statisticians like it
- No statistical evidence provided for differences in intercepts
\(\implies\) have to test ourselves 😢
Model Diagnostics
## Check the residuals after
## including a separate intercept
## and slope for species
par(mfrow = c(2, 2))
plot(bill_length_int_lm)
par(mfrow = c(1, 1))Model Selection
- How do we decide on the best model?
- A common technique is Analysis of Variance (ANOVA)
- Classic ANOVA checks importance of each term within a model
## Run a separate ANOVA on each model
anova(bill_length_lm)
anova(bill_length_sp_lm)
anova(bill_length_int_lm). . .
- Does this give any clue as to the best model?
Model Selection
- We have progressively added terms to each model
- Can use ANOVA to compare suitability of each model
## Compare the models using ANOVA
## If a model is a significant improvement, the p-value from the
## F test will be clearly significant
anova(
bill_length_lm,
bill_length_sp_lm,
bill_length_int_lm
). . .
- It looks like the separate slopes are not an improvement
- The separate intercepts are an improvement
Speeding The Process Up
- This was a careful breakdown of finding the best model
- We can partially automate this and use some shortcuts
. . .
- The shorthand for an interaction term with separate intercepts is
*
## Refit the model with an interaction term using the shorthand code
bill_length_int_lm <- lm(
bill_length_mm ~ species * body_mass_g, data = penguins
)
summary(bill_length_int_lm). . .
- Alternatively, all terms can be placed inside brackets & raised to a power
(species + body_mass_g)^2would give two-way interactions
Speeding The Process Up
- The AIC is a function of the number of model terms and the log likelihood function
- Beyond the scope of today
- Doesn’t perform well with highly correlated predictors
- After specifying a ‘full’ model \(\implies\) use
step()to remove redundant terms- Removes terms in a step-wise manner
- Uses Akaike’s Information Criterion (AIC) to determine optimal model
- Finds model with lowest AIC
step(bill_length_int_lm)Objects of Class summary.lm
- The coefficients element of the basic
lmobject only had the fitted values- Not the std errors, t-tests or p-values
## Extract the coefficients directly from the linear model
bill_length_sp_lm$coefficients. . .
- These values are produced by the function
summary()
## Extract the coefficients from the summary object
summary(bill_length_sp_lm)$coefficientsObjects of Class summary.lm
## Check the class of the output from summary
summary(bill_length_sp_lm) |> class()
## Prove conclusively that it is really a list
## with a class attribute stuck on it
summary(bill_length_sp_lm) |> is.list()
## See what attributes it has
summary(bill_length_sp_lm) |> attributes()
## Check the full details
summary(bill_length_sp_lm) |> glimpse(). . .
- The full complexity of the object is mostly irrelevant
Using A More Tidyverse Friendly Approach
- The function
tidy()from the packagebroomis a catch-all function- Will return a tibble
- Returns the same from
lmandsummary.lmobjects
bill_length_sp_lm |> broom::tidy()Adding Significance Stars
- The easiest way for me as a
case_when()
bill_length_sp_lm |>
broom::tidy() |>
mutate(
stars = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
)
)Or Fitting a Model On The Fly
- Obviously no room for checking model diagnostics
penguins |>
## Piped data can be recalled using `_`
lm( bill_length_mm ~ species * body_mass_g, data = _) |>
step() |> # Fit the best model using the AIC
broom::tidy() |> # Turn the output into a tibble & add stars
mutate(
stars = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
)
)S3 Method Dispatch
S3 Method Dispatch
- Objects with class
lmandsummary.lmareS3objects - Very informal class structure in
R - Easy to work with \(\implies\) easy to break
. . .
- When we printed these objects \(\implies\)
print.lm()orprint.summary.lm() - Likewise when we plotted the
lmobject \(\implies\)plot.lm()
. . .
?plot.lmS3 Method Dispatch
- If we only want a subset of figures
par(mfrow = c(1, 2))
plot(
bill_length_sp_lm, which = c(1, 3),
caption = list(
"Separate Intercepts: Residuals vs Fitted", NULL,
"Separate Intercepts: Scale-Location"
)
)par(mfrow = c(1, 1))S3 Method Dispatch
- We can also check what we pass to
broom::tidy()
?broom::tidy.lm. . .
broom::tidy()has multiple versions for objects of different classes- e.g.
broom::tidy.prcomp()for PCA results
- e.g.
S3 Method Dispatch
S3objects can be easily broken
broken_lm <- bill_length_sp_lm
class(broken_lm) <- "htest"
broken_lm. . .
Rlooked for theprint.htestmethod- The structure didn’t match what was expected \(\implies\) nonsense output
Confidence Intervals
- For confidence or prediction intervals, we can use
predict.lm()
- 95% Confidence Intervals: The mean will be in the given interval 95% of the time
- 95% Prediction Intervals: An observation will be in the given interval 95% of the time
- We will need a new data frame to make the predictions about
Confidence Intervals
## Create some new penguins
new_penguins <- tibble(
species = c("Adelie", "Gentoo"),
body_mass_g = 4500
)
## Predict their mean bill length
predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")
## Predict the range a new observation may lie in
predict(bill_length_sp_lm, newdata = new_penguins, interval = "prediction")Confidence Intervals
- We may wish to include our new penguins in these results
## Predict their mean bill length, but include the original data
new_penguins |>
bind_cols(
predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")
)Confidence Intervals
- Confidence Intervals for model terms can also be found using
broom::tidy()
## Use broom:tidy to find confidence intervals for coefficients
bill_length_sp_lm |>
broom::tidy(conf.int = TRUE) Closing Comments
Additional Plotting Comments
- I rarely show diagnostic plots:
- For me when fitting data & performing analysis
- No need for ggplot versions
tibble(
fitted= fitted(bill_length_sp_lm),
residuals = resid(bill_length_sp_lm)
) |>
ggplot(aes(fitted, residuals)) +
geom_point() +
geom_smooth(
se = FALSE, colour = 'red',
linewidth = 0.5
) +
ggtitle("Residuals vs Fitted") +
labs(
x = "Fitted Values", y = "Residuals"
) tibble(
residuals = rstandard(bill_length_sp_lm)
) |>
ggplot(aes(sample = residuals)) +
geom_qq() +
geom_qq_line() +
ggtitle("Q-Q Residuals") +
labs(
x = "Theoretical Quantiles",
y = "Standardized Residuals"
) Additional Plotting Packages
- Multiple options for great looking plots
ggpmiscfor adding correlations, regression equations etcggstatsfor multiple fresh perspectives on coefficientsggstatsplotalso a wide range of plotting capabilities
Challenges
- How could we account for
sexin the existingpenguinsmodel? - Load the
pigsdataset - Make a boxplot:
lenwill be the predictor (\(y\))- Place
doseon the \(x\)-axis usingsuppto fill the boxes
- Find the best model for
len, usingsuppanddoseas the predictors - Check the residuals and diagnostic plots
- Make sure you can interpret the model coefficients
- What is the 99% confidence interval for
supp = "VC"&dose = "High"- What does this mean?
References
Footnotes
https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/designmatrices.html↩︎